An analysis of the Covid-19 Pandemic

Marco Huang(0201), Jingyun Li(0101) Medical-Illustration-2.jpeg

Introduction

COVID-19 is the disease caused by SARS-CoV-2, the coronavirus that emerged in December 2019. COVID-19 can be severe, and has caused millions of deaths around the world as well as lasting health problems in some who have survived the illness. The coronavirus can be spread from person to person. It is diagnosed with a test.

Three years after the break out of the corona virus, the growth of the COVID confirmation rate seems to slow down, providing us with the best timing to examine the pandemic as a whole. Here in part one, we would like to look at COVID in the United States, and several representative states in specific, and discuss what the data illustrates to us. For part two, we're going to look at the relation between confirmation cases and housing prices.

Part 1: About COVID-19

1. Data Collection

The data collection stage is very important. Without proper data to work with, no analysis can be done. Make sure to find credible and recent data to create accurate models and analysis.

In this project the Covid-19 data we used comes from Johns Hopkins University and is available at this link: https://github.com/CSSEGISandData/COVID-19

1.1 Tool used

We used the following tools to collect this data: pandas, numpy, matplotlib, scikit-learn, seaborn, os, folium, and more.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn
import warnings
import os
import folium
import scipy.stats as stats
from statsmodels.formula.api import ols as o
from sklearn import linear_model
import re
warnings.filterwarnings('ignore')

1.2 Data processing

1.2.2 US overall

We want to first look at the overall confirmed and death cases in the US. Here we read the data of the confirmed covid cases throughout the whole world from 1/22/20 till today. For this project, We would focus on the united states.

Below is the global confirmation data. It includes all the countries, their latitude, longitude, and the daily cumulative confirmation among all these countries.

In [2]:
world_conf = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv", sep=',')
world_conf.head()
Out[2]:
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 12/6/22 12/7/22 12/8/22 12/9/22 12/10/22 12/11/22 12/12/22 12/13/22 12/14/22 12/15/22
0 NaN Afghanistan 33.93911 67.709953 0 0 0 0 0 0 ... 206331 206414 206465 206504 206543 206603 206702 206743 206788 206879
1 NaN Albania 41.15330 20.168300 0 0 0 0 0 0 ... 333455 333472 333490 333491 333521 333533 333535 333567 333591 333613
2 NaN Algeria 28.03390 1.659600 0 0 0 0 0 0 ... 271122 271128 271135 271140 271146 271146 271147 271149 271156 271156
3 NaN Andorra 42.50630 1.521800 0 0 0 0 0 0 ... 47219 47446 47446 47446 47446 47446 47446 47446 47606 47606
4 NaN Angola -11.20270 17.873900 0 0 0 0 0 0 ... 104750 104808 104808 104808 104808 104808 104808 104808 104946 104946

5 rows × 1063 columns

We first extract the confirmation data for the US from the world data frame. We then calculated the increase of confirmation covid for every day in between and transposed it afterward.

In [3]:
us_conf = pd.melt(world_conf, ['Province/State','Country/Region', 'Lat', 'Long'], var_name="Date", value_name='conf_cases')
us_conf = us_conf.drop(columns=['Province/State', 'Lat', 'Long'])
us_conf = us_conf.rename(columns={'Country/Region': 'Country'})
us_conf["Date"] = pd.to_datetime(us_conf['Date'])
us_conf = us_conf.groupby(['Country', 'Date']).sum()
us_conf["Next_day"] = us_conf['conf_cases'].shift(fill_value=0)
us_conf["conf_change"]= us_conf['conf_cases'] - us_conf['Next_day']
us_conf = us_conf.drop(columns=['Next_day'])
us_conf = us_conf.reset_index()
us_conf = us_conf[us_conf["conf_change"] >= 0] 
us_conf = us_conf[us_conf["Country"]=="US"]
us_conf = us_conf.set_index("Date")
us_conf = us_conf.drop(columns=['Country'])
us_conf.head()
Out[3]:
conf_cases conf_change
Date
2020-01-23 1 0
2020-01-24 2 1
2020-01-25 2 0
2020-01-26 5 3
2020-01-27 5 0

Below is the global death data. It includes all the countries, their latitude, longitude, and the daily cumulative death among all these countries.

In [4]:
world_death = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv", sep=',')
world_death.head()
Out[4]:
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 12/6/22 12/7/22 12/8/22 12/9/22 12/10/22 12/11/22 12/12/22 12/13/22 12/14/22 12/15/22
0 NaN Afghanistan 33.93911 67.709953 0 0 0 0 0 0 ... 7835 7837 7839 7839 7839 7839 7840 7843 7843 7843
1 NaN Albania 41.15330 20.168300 0 0 0 0 0 0 ... 3594 3594 3594 3594 3594 3594 3594 3594 3594 3594
2 NaN Algeria 28.03390 1.659600 0 0 0 0 0 0 ... 6881 6881 6881 6881 6881 6881 6881 6881 6881 6881
3 NaN Andorra 42.50630 1.521800 0 0 0 0 0 0 ... 157 158 158 158 158 158 158 158 158 158
4 NaN Angola -11.20270 17.873900 0 0 0 0 0 0 ... 1925 1925 1925 1925 1925 1925 1925 1925 1928 1928

5 rows × 1063 columns

We did the same thing for the death data -- we calculated the increase of death cases for every day in between and transposed it afterward.

In [5]:
us_death = pd.melt(world_death, ['Province/State','Country/Region', 'Lat', 'Long'], var_name="Date", value_name='death_cases')
us_death = us_death.drop(columns=['Province/State', 'Lat', 'Long'])
us_death = us_death.rename(columns={'Country/Region': 'Country'})
us_death["Date"] = pd.to_datetime(us_death['Date'])
us_death = us_death.groupby(['Country', 'Date']).sum()
us_death["Next_day"] = us_death['death_cases'].shift(fill_value=0)
us_death["death_change"]= us_death['death_cases'] - us_death['Next_day']
us_death = us_death.drop(columns=['Next_day'])
us_death = us_death.reset_index()
us_death = us_death[us_death["death_change"] >= 0] 
us_death = us_death[us_death["Country"]=="US"]
us_death = us_death.set_index("Date")
us_death = us_death.drop(columns=['Country'])
us_death.head()
Out[5]:
death_cases death_change
Date
2020-01-22 0 0
2020-01-23 0 0
2020-01-24 0 0
2020-01-25 0 0
2020-01-26 0 0

We then joined the tables into a data frame us_overall. The new data frame has confirmed cases, daily confirmed the change, death cases, and daily death change data all in one.

In [6]:
us_overall = us_conf.join(us_death, how='outer')
us_overall.head()
Out[6]:
conf_cases conf_change death_cases death_change
Date
2020-01-22 NaN NaN 0.0 0.0
2020-01-23 1.0 0.0 0.0 0.0
2020-01-24 2.0 1.0 0.0 0.0
2020-01-25 2.0 0.0 0.0 0.0
2020-01-26 5.0 3.0 0.0 0.0

1.2.2 US states

Below is the confirmed cases in each state of the United States. Here we would also want to sort out some states that is representative of a certain area. Below is the states we picked for this project, we selected one state for each of the 9 regions.

  • New England: Maine
  • Middle Atlantic: New York
  • East North Central: Wisconsin
  • West North Central: Kansas
  • South Atlantic: Maryland
  • East South Central: Alabama
  • West South Central: Texas
  • Mountain: Arizona
  • Pacific: California

We first read in the data from the Hopkins site.

In [7]:
conf = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv", sep=',')
conf.head()
Out[7]:
UID iso2 iso3 code3 FIPS Admin2 Province_State Country_Region Lat Long_ ... 12/6/22 12/7/22 12/8/22 12/9/22 12/10/22 12/11/22 12/12/22 12/13/22 12/14/22 12/15/22
0 84001001 US USA 840 1001.0 Autauga Alabama US 32.539527 -86.644082 ... 18680 18752 18752 18752 18752 18752 18752 18752 18847 18847
1 84001003 US USA 840 1003.0 Baldwin Alabama US 30.727750 -87.722071 ... 66730 66951 66951 66951 66951 66951 66951 66951 67221 67221
2 84001005 US USA 840 1005.0 Barbour Alabama US 31.868263 -85.387129 ... 6980 6989 6989 6989 6989 6989 6989 6989 7007 7007
3 84001007 US USA 840 1007.0 Bibb Alabama US 32.996421 -87.125115 ... 7637 7653 7653 7653 7653 7653 7653 7653 7668 7668
4 84001009 US USA 840 1009.0 Blount Alabama US 33.982109 -86.567906 ... 17500 17559 17559 17559 17559 17559 17559 17559 17648 17648

5 rows × 1070 columns

We choose 9 state which locate the 9 area of US.

In [8]:
MD = conf[conf["Province_State"] == "Maryland"]
frames = [MD]
confirmed = MD.drop(conf.columns[0:11], axis=1)
confirmed = confirmed.append(confirmed.sum(numeric_only=True), ignore_index=True)
confirmed.drop(confirmed.index[0:26], inplace=True)
list1 = ["Maine", "New York", "Wisconsin", "Kansas", "Alabama", "Texas", "Arizona", "California"]
for x in list1:
  state = conf[conf["Province_State"] == x]
  frames.append(state)
  time = state.drop(state.columns[0:11], axis=1)
  sum = time.append(time.sum(numeric_only=True), ignore_index=True)
  confirmed = confirmed.append(sum.sum(numeric_only=True), ignore_index=True)
confirmed = confirmed.rename(index={0: 'Maryland', 1: 'Maine', 2: 'New York', 3: 'Wisconsin', 4: 'Kansas', 5: 'Alabama', 6: 'Texas', 7: 'Arizona', 8: 'California'})
result = pd.concat(frames)
confirmed = confirmed.swapaxes("index", "columns")
confirmed.index = pd.to_datetime(confirmed.index)
confirmed.head()
Out[8]:
Maryland Maine New York Wisconsin Kansas Alabama Texas Arizona California
2020-01-22 0 0 0 0 0 0 0 0 0
2020-01-23 0 0 0 0 0 0 0 0 0
2020-01-24 0 0 0 0 0 0 0 0 0
2020-01-25 0 0 0 0 0 0 0 0 0
2020-01-26 0 0 0 0 0 0 0 2 4

Number of deaths by state in the US

In [9]:
us_dead = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv", sep=',')
us_dead.head()
Out[9]:
UID iso2 iso3 code3 FIPS Admin2 Province_State Country_Region Lat Long_ ... 12/6/22 12/7/22 12/8/22 12/9/22 12/10/22 12/11/22 12/12/22 12/13/22 12/14/22 12/15/22
0 84001001 US USA 840 1001.0 Autauga Alabama US 32.539527 -86.644082 ... 230 230 230 230 230 230 230 230 230 230
1 84001003 US USA 840 1003.0 Baldwin Alabama US 30.727750 -87.722071 ... 716 717 717 717 717 717 717 717 717 717
2 84001005 US USA 840 1005.0 Barbour Alabama US 31.868263 -85.387129 ... 103 103 103 103 103 103 103 103 103 103
3 84001007 US USA 840 1007.0 Bibb Alabama US 32.996421 -87.125115 ... 108 108 108 108 108 108 108 108 108 108
4 84001009 US USA 840 1009.0 Blount Alabama US 33.982109 -86.567906 ... 259 260 260 260 260 260 260 260 260 260

5 rows × 1071 columns

The totle number of dead in 9 states.

In [10]:
MD2 = us_dead[us_dead["Province_State"] == "Maryland"]
death = MD2.drop(MD2.columns[0:12], axis=1)
death = death.append(death.sum(numeric_only=True), ignore_index=True)
death.drop(death.index[0:26], inplace=True)
for x in list1:
  state2 = us_dead[us_dead["Province_State"] == x]
  time2 = state2.drop(state2.columns[0:12], axis=1)
  sum2 = time2.append(time2.sum(numeric_only=True), ignore_index=True)
  death = death.append(sum2.sum(numeric_only=True), ignore_index=True)
death = death.rename(index={0: 'Maryland', 1: 'Maine', 2: 'New York', 3: 'Wisconsin', 4: 'Kansas', 5: 'Alabama', 6: 'Texas', 7: 'Arizona', 8: 'California'})
death = death.swapaxes("index", "columns")
death.index = pd.to_datetime(death.index)
death.head()
Out[10]:
Maryland Maine New York Wisconsin Kansas Alabama Texas Arizona California
2020-01-22 0 0 0 0 0 0 0 0 0
2020-01-23 0 0 0 0 0 0 0 0 0
2020-01-24 0 0 0 0 0 0 0 0 0
2020-01-25 0 0 0 0 0 0 0 0 0
2020-01-26 0 0 0 0 0 0 0 0 0

2. Data representation and analysis

2.1 Overall trend among all regions in the US

2.1.1 Confirmation trend in the US

In [11]:
us_overall.plot(y="conf_cases", legend=None)
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe3d69d67f0>
In [12]:
us_overall.plot(y="conf_change")
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe3d6707b50>

2.1.2 Deaths trend in the US

In [13]:
us_overall.plot(y="death_cases", legend=None)
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe3d69d6640>
In [14]:
us_overall.plot(y="death_change")
# , figsize=(15,10)
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe3d6132ee0>

2.2 Trend among all regions in the US

In [15]:
confirmed.plot()
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe3d607ea30>

Cases in nine states continue to trend upward. It is obvious that the number of confirmed cases in June of 2022 will suddenly increase significantly. One possible reason is that people go on vacation during the summer vacation, which further increases the chance of contact.

In [16]:
death.plot()
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe3d5f164f0>

We then did the same thing to the 9 states selected as what we did to the global confirmed and death database-- we calculated the daily increase confirmation rate throught the nine selected states and transposed the dataframe.

In [17]:
result = result.drop(result.columns[[0,1,2,3,4,7,10]], axis=1)
result = pd.melt(result, ['Admin2','Province_State', 'Lat', 'Long_'], var_name="Date", value_name='Cases')
result = result.drop(columns=['Province_State'])
result = result.rename(columns={'Admin2': 'Admin', 'Long_': 'Long'})
result["Date"] = pd.to_datetime(result['Date'])
result = result.groupby(['Admin', 'Date']).sum()
result["Next_day"] = result['Cases'].shift(fill_value=0)
result["Daily_change"]= result['Cases'] - result['Next_day']
result = result.drop(columns=['Next_day'])
result = result.reset_index()
result = result[result["Daily_change"] >= 0]

Afterwards, we put them into the US map and represent the daily increase as a heat map.

In [18]:
result["Date"] = result["Date"].astype(str)
fig = px.scatter_geo(result, lat="Lat", lon="Long",
                     hover_name="Admin", size="Daily_change",size_max=80,
                     animation_frame="Date",
                     scope = "usa",
                     title = "Total Cases")
fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 100
fig.show()

3. Hypothesis testing

The mid point of covid is roughly in the middle of June. So we would set that as a middle point and conduct hypothesis testing for the data before the middle of June 2021 and that after.

$H_0$: the mean proportion of death before June 2021 is equal to that after June 2021

$H_\alpha$: the mean proportion of death before June 2021 is greater than that after June 2021

Here is a visualization of the covid death proportion.

In [19]:
us_overall['proportion'] = us_overall.apply(lambda row: row['death_change'] / row['conf_change'], axis=1)
us_overall.plot(y= 'proportion')
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe3d61f9700>
In [20]:
us_overall = us_overall.dropna()
us_overall
Out[20]:
conf_cases conf_change death_cases death_change proportion
Date
2020-01-24 2.0 1.0 0.0 0.0 0.000000
2020-01-26 5.0 3.0 0.0 0.0 0.000000
2020-01-29 6.0 1.0 0.0 0.0 0.000000
2020-01-31 8.0 2.0 0.0 0.0 0.000000
2020-02-03 11.0 3.0 0.0 0.0 0.000000
... ... ... ... ... ...
2022-12-11 99422778.0 4939.0 1084601.0 0.0 0.000000
2022-12-12 99479472.0 56694.0 1084813.0 212.0 0.003739
2022-12-13 99570282.0 90810.0 1085386.0 573.0 0.006310
2022-12-14 99716762.0 146480.0 1086302.0 916.0 0.006253
2022-12-15 99826698.0 109936.0 1087013.0 711.0 0.006467

1029 rows × 5 columns

We then want to extract the proportion data from both time lines.

In [21]:
first = us_overall.head(515)
first
Out[21]:
conf_cases conf_change death_cases death_change proportion
Date
2020-01-24 2.0 1.0 0.0 0.0 0.000000
2020-01-26 5.0 3.0 0.0 0.0 0.000000
2020-01-29 6.0 1.0 0.0 0.0 0.000000
2020-01-31 8.0 2.0 0.0 0.0 0.000000
2020-02-03 11.0 3.0 0.0 0.0 0.000000
... ... ... ... ... ...
2021-07-13 34032036.0 26476.0 603884.0 357.0 0.013484
2021-07-14 34063184.0 31148.0 604211.0 327.0 0.010498
2021-07-15 34095126.0 31942.0 604567.0 356.0 0.011145
2021-07-16 34173158.0 78032.0 605020.0 453.0 0.005805
2021-07-17 34187224.0 14066.0 605092.0 72.0 0.005119

515 rows × 5 columns

In [22]:
last = us_overall.tail(515)
last
Out[22]:
conf_cases conf_change death_cases death_change proportion
Date
2021-07-17 34187224.0 14066.0 605092.0 72.0 0.005119
2021-07-18 34207594.0 20370.0 605187.0 95.0 0.004664
2021-07-19 34256821.0 49227.0 605385.0 198.0 0.004022
2021-07-20 34296710.0 39889.0 605675.0 290.0 0.007270
2021-07-21 34344582.0 47872.0 606038.0 363.0 0.007583
... ... ... ... ... ...
2022-12-11 99422778.0 4939.0 1084601.0 0.0 0.000000
2022-12-12 99479472.0 56694.0 1084813.0 212.0 0.003739
2022-12-13 99570282.0 90810.0 1085386.0 573.0 0.006310
2022-12-14 99716762.0 146480.0 1086302.0 916.0 0.006253
2022-12-15 99826698.0 109936.0 1087013.0 711.0 0.006467

515 rows × 5 columns

In [23]:
stats.ttest_rel(first["proportion"], last["proportion"])
Out[23]:
Ttest_relResult(statistic=14.502805772619178, pvalue=3.3602135329649983e-40)

The result p-value for the paired t-test is roughly 3.36e-40, which is significantly less than the alpha value of 0.05. Therefore we could reject the null hypothesis that the mean proportion of death before June 2021 is equal to that after June 2021.

Part 2: COVID and housing price

1. Data Collection

  • New England: Maine
  • Middle Atlantic: New York
  • East North Central: Wisconsin
  • West North Central: Kansas
  • South Atlantic: Maryland
  • East South Central: Alabama
  • West South Central: Texas
  • Mountain: Arizona
  • Pacific: California
In [24]:
housing = pd.read_csv("/content/HPI_PO_monthly_hist.csv")
housing.index = pd.to_datetime(housing['Month'])
housing = housing.drop(columns=["Month"])
housing = housing.drop(housing.index[0])
housing.head()
Out[24]:
East North Central East South Central Middle Atlantic Mountain New England Pacific South Atlantic West North Central West South Central USA
Month
2020-02-01 232.67 260.54 249.97 392.87 264.70 327.64 289.09 278.23 291.92 282.50
2020-03-01 235.97 262.93 251.29 397.67 268.38 331.57 290.31 281.34 294.82 285.24
2020-04-01 238.33 265.19 253.43 399.73 270.87 334.22 292.29 284.73 296.96 287.59
2020-05-01 239.57 266.47 255.06 399.92 269.87 332.23 293.87 285.82 297.94 288.36
2020-06-01 243.02 269.12 256.54 404.90 274.06 336.72 296.63 289.66 303.16 291.92

2. Data management/representation + Exploratory data analysis

In [25]:
housing.plot(y="USA")
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe3f0c62e20>

3. Hypothesis testing

In [26]:
first = confirmed[confirmed.index.day == 1]
first = first.drop(first.index[[33, 34,32]])
hypo = pd.concat([housing, first], axis=1)
hypo = hypo.rename(columns={'East North Central': 'East_North','East South Central': 'East_South','Middle Atlantic': 'Middle_Atlantic','New England': 'New_England','South Atlantic': 'South_Atlantic', 'West North Central': 'West_North', 'West South Central': 'West_South', 'New York': 'New_York'})
hypo.head()
Out[26]:
East_North East_South Middle_Atlantic Mountain New_England Pacific South_Atlantic West_North West_South USA Maryland Maine New_York Wisconsin Kansas Alabama Texas Arizona California
2020-02-01 232.67 260.54 249.97 392.87 264.70 327.64 289.09 278.23 291.92 282.50 0 0 0 0 0 0 0 2 6
2020-03-01 235.97 262.93 251.29 397.67 268.38 331.57 290.31 281.34 294.82 285.24 0 0 0 0 0 0 0 2 38
2020-04-01 238.33 265.19 253.43 399.73 270.87 334.22 292.29 284.73 296.96 287.59 1986 606 180754 3112 970 2350 7046 2826 18530
2020-05-01 239.57 266.47 255.06 399.92 269.87 332.23 293.87 285.82 297.94 288.36 23472 2246 625598 14628 9268 14880 49248 15938 105092
2020-06-01 243.02 269.12 256.54 404.90 274.06 336.72 296.63 289.66 303.16 291.92 53327 4698 749636 37086 19840 37050 108604 40258 232608
In [27]:
import seaborn as sns
hypo_state = ["Maine", "New_York", "Wisconsin", "Kansas", "Maryland", "Alabama", "Texas", "Arizona", "California"]
hpi = ["New_England", "Middle_Atlantic", "East_North", "West_North", "South_Atlantic", "East_South", "West_South", "Mountain", "Pacific"]

#fig, axes = plt.subplots(9, 1, figsize=(10, 50))

for s, h in zip(hypo_state, hpi):
  b = np.array(hypo[s].values)
  a = np.array(hypo[h].values.reshape(-1, 1))
  m = linear_model.LinearRegression().fit(a,b)
  print(("coefficient of " + s + ":"))
  print(str(m.score(a,b)))
  formula = h + " ~ " + s
  result = o(formula = formula, data = hypo).fit()
  print (result.summary())
  plt.plot(a,b,'k.')
  plt.plot(a, m.predict(a))
  plt.show()
coefficient of Maine:
0.864910943785443
                            OLS Regression Results                            
==============================================================================
Dep. Variable:            New_England   R-squared:                       0.865
Model:                            OLS   Adj. R-squared:                  0.860
Method:                 Least Squares   F-statistic:                     192.1
Date:                Fri, 16 Dec 2022   Prob (F-statistic):           1.41e-14
Time:                        12:06:19   Log-Likelihood:                -129.80
No. Observations:                  32   AIC:                             263.6
Df Residuals:                      30   BIC:                             266.5
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    290.2138      3.528     82.250      0.000     283.008     297.420
Maine          0.0002   1.31e-05     13.859      0.000       0.000       0.000
==============================================================================
Omnibus:                        2.406   Durbin-Watson:                   0.119
Prob(Omnibus):                  0.300   Jarque-Bera (JB):                1.269
Skew:                          -0.054   Prob(JB):                        0.530
Kurtosis:                       2.031   Cond. No.                     3.73e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.73e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
coefficient of New_York:
0.9242582020415863
                            OLS Regression Results                            
==============================================================================
Dep. Variable:        Middle_Atlantic   R-squared:                       0.924
Model:                            OLS   Adj. R-squared:                  0.922
Method:                 Least Squares   F-statistic:                     366.1
Date:                Fri, 16 Dec 2022   Prob (F-statistic):           2.32e-18
Time:                        12:06:19   Log-Likelihood:                -113.68
No. Observations:                  32   AIC:                             231.4
Df Residuals:                      30   BIC:                             234.3
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    262.3659      2.387    109.909      0.000     257.491     267.241
New_York      7.5e-06   3.92e-07     19.133      0.000     6.7e-06     8.3e-06
==============================================================================
Omnibus:                        6.631   Durbin-Watson:                   0.245
Prob(Omnibus):                  0.036   Jarque-Bera (JB):                2.019
Skew:                          -0.018   Prob(JB):                        0.364
Kurtosis:                       1.770   Cond. No.                     9.43e+06
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.43e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
coefficient of Wisconsin:
0.9380086401061278
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             East_North   R-squared:                       0.938
Model:                            OLS   Adj. R-squared:                  0.936
Method:                 Least Squares   F-statistic:                     453.9
Date:                Fri, 16 Dec 2022   Prob (F-statistic):           1.14e-19
Time:                        12:06:19   Log-Likelihood:                -107.87
No. Observations:                  32   AIC:                             219.7
Df Residuals:                      30   BIC:                             222.7
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    243.9571      2.023    120.608      0.000     239.826     248.088
Wisconsin   2.227e-05   1.05e-06     21.306      0.000    2.01e-05    2.44e-05
==============================================================================
Omnibus:                        3.309   Durbin-Watson:                   0.333
Prob(Omnibus):                  0.191   Jarque-Bera (JB):                1.541
Skew:                           0.140   Prob(JB):                        0.463
Kurtosis:                       1.962   Cond. No.                     3.04e+06
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.04e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
coefficient of Kansas:
0.9462160951861599
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             West_North   R-squared:                       0.946
Model:                            OLS   Adj. R-squared:                  0.944
Method:                 Least Squares   F-statistic:                     527.8
Date:                Fri, 16 Dec 2022   Prob (F-statistic):           1.35e-20
Time:                        12:06:20   Log-Likelihood:                -109.99
No. Observations:                  32   AIC:                             224.0
Df Residuals:                      30   BIC:                             226.9
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    290.6082      2.154    134.895      0.000     286.208     295.008
Kansas       5.37e-05   2.34e-06     22.974      0.000    4.89e-05    5.85e-05
==============================================================================
Omnibus:                        2.513   Durbin-Watson:                   0.368
Prob(Omnibus):                  0.285   Jarque-Bera (JB):                1.308
Skew:                           0.080   Prob(JB):                        0.520
Kurtosis:                       2.023   Cond. No.                     1.45e+06
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.45e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
coefficient of Maryland:
0.966012441850216
                            OLS Regression Results                            
==============================================================================
Dep. Variable:         South_Atlantic   R-squared:                       0.966
Model:                            OLS   Adj. R-squared:                  0.965
Method:                 Least Squares   F-statistic:                     852.7
Date:                Fri, 16 Dec 2022   Prob (F-statistic):           1.37e-23
Time:                        12:06:20   Log-Likelihood:                -115.60
No. Observations:                  32   AIC:                             235.2
Df Residuals:                      30   BIC:                             238.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    295.0244      2.611    112.997      0.000     289.692     300.357
Maryland       0.0001   4.15e-06     29.201      0.000       0.000       0.000
==============================================================================
Omnibus:                        1.138   Durbin-Watson:                   0.400
Prob(Omnibus):                  0.566   Jarque-Bera (JB):                1.126
Skew:                           0.365   Prob(JB):                        0.569
Kurtosis:                       2.441   Cond. No.                     1.00e+06
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large,  1e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
coefficient of Alabama:
0.9716002661382498
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             East_South   R-squared:                       0.972
Model:                            OLS   Adj. R-squared:                  0.971
Method:                 Least Squares   F-statistic:                     1026.
Date:                Fri, 16 Dec 2022   Prob (F-statistic):           9.23e-25
Time:                        12:06:20   Log-Likelihood:                -105.61
No. Observations:                  32   AIC:                             215.2
Df Residuals:                      30   BIC:                             218.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    268.0927      1.916    139.922      0.000     264.180     272.006
Alabama     3.889e-05   1.21e-06     32.037      0.000    3.64e-05    4.14e-05
==============================================================================
Omnibus:                        1.053   Durbin-Watson:                   0.598
Prob(Omnibus):                  0.591   Jarque-Bera (JB):                0.891
Skew:                           0.139   Prob(JB):                        0.640
Kurtosis:                       2.231   Cond. No.                     2.52e+06
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.52e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
coefficient of Texas:
0.9699563858041998
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             West_South   R-squared:                       0.970
Model:                            OLS   Adj. R-squared:                  0.969
Method:                 Least Squares   F-statistic:                     968.5
Date:                Fri, 16 Dec 2022   Prob (F-statistic):           2.15e-24
Time:                        12:06:20   Log-Likelihood:                -107.88
No. Observations:                  32   AIC:                             219.8
Df Residuals:                      30   BIC:                             222.7
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    299.3697      2.055    145.675      0.000     295.173     303.567
Texas       7.722e-06   2.48e-07     31.122      0.000    7.22e-06    8.23e-06
==============================================================================
Omnibus:                        2.863   Durbin-Watson:                   0.567
Prob(Omnibus):                  0.239   Jarque-Bera (JB):                1.415
Skew:                          -0.113   Prob(JB):                        0.493
Kurtosis:                       1.995   Cond. No.                     1.32e+07
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.32e+07. This might indicate that there are
strong multicollinearity or other numerical problems.
coefficient of Arizona:
0.947586025218309
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               Mountain   R-squared:                       0.948
Model:                            OLS   Adj. R-squared:                  0.946
Method:                 Least Squares   F-statistic:                     542.4
Date:                Fri, 16 Dec 2022   Prob (F-statistic):           9.17e-21
Time:                        12:06:20   Log-Likelihood:                -134.47
No. Observations:                  32   AIC:                             272.9
Df Residuals:                      30   BIC:                             275.9
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    408.5110      4.718     86.578      0.000     398.875     418.147
Arizona     4.528e-05   1.94e-06     23.289      0.000    4.13e-05    4.92e-05
==============================================================================
Omnibus:                        0.078   Durbin-Watson:                   0.319
Prob(Omnibus):                  0.962   Jarque-Bera (JB):                0.295
Skew:                          -0.020   Prob(JB):                        0.863
Kurtosis:                       2.532   Cond. No.                     3.88e+06
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.88e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
coefficient of California:
0.8944032883329911
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                Pacific   R-squared:                       0.894
Model:                            OLS   Adj. R-squared:                  0.891
Method:                 Least Squares   F-statistic:                     254.1
Date:                Fri, 16 Dec 2022   Prob (F-statistic):           3.44e-16
Time:                        12:06:20   Log-Likelihood:                -132.95
No. Observations:                  32   AIC:                             269.9
Df Residuals:                      30   BIC:                             272.8
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    345.7579      4.357     79.362      0.000     336.860     354.656
California  6.247e-06   3.92e-07     15.941      0.000    5.45e-06    7.05e-06
==============================================================================
Omnibus:                        0.409   Durbin-Watson:                   0.229
Prob(Omnibus):                  0.815   Jarque-Bera (JB):                0.552
Skew:                          -0.076   Prob(JB):                        0.759
Kurtosis:                       2.375   Cond. No.                     1.72e+07
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.72e+07. This might indicate that there are
strong multicollinearity or other numerical problems.

We can see that the coefficient for 9 states are all greater than 0.7, which tell us that the change in House Price Index (HPI) and number of confirmed in each state has strong relationship.

For the Maine. The $R^2$ is 0.865. So, 86.5% of the HPI can explain by the # comfirmed. The p-value of the Intercept in very low(lower than default $\alpha$ 0.05) which indecates that Intercept in significant in the model. We should keep $\beta_0$ value in the equation.

The p-value of the # comfirmed in Maine in low(lower than default $\alpha$ 0.05) which indecates that number of confirmed is significant in the model and we should keep $\beta_1$ value in the equation.

Assumption: The data is a random sample from the normal distribution. Based on the plot there is linear relationship. The dependent variable is continuous variable.

Base on the information above, the model is reliable and can be used.

5. Insights attained

Motivation: each tutorial should be sufficiently motivated. If there is not motivation for the analysis, why would we ‘do data science’ on this topic?

Understanding: the reader of the tutorial should walk away with some new understanding of the topic at hand. If it’s not possible for a reader to state ‘what they learned’ from reading your tutorial, then why do the analysis?

Resources: tutorials should help the reader learn a skill, but they should also provide a launching pad for the reader to further develop that skill. The tutorial should link to additional resources wherever appropriate, so that a well-motivated reader can read further on techniques that have been used in the tutorial.

Prose: it’s very easy to write the literal English for what the Python code is doing, but that’s not very useful. The prose should enhance, the tutorial, adding additional context and insight.

Code: code should be clear and commented. Function definitions should be described and given context/motivation. If the prose helps the reader understand why you’ve written the code, the comments in the code should be sufficient for the reader to learn how.

Pipeline: all stages of the pipeline should be discussed. We will be looking for ‘good science’, with discussion of each stage and what it’s implications/consequences are.

Communication of Approach: every technical choice has alternatives, why did you choose the approach taken in the tutorial? A reader should walk away with some idea of what the trade-offs may be.

Formatting and Subjective Evaluation: does the tutorial seem polished and ‘publishable’, or haphazard and quickly thrown together? The tutorials should read as well put-together and having undergone a few iterations of editing and refinement. This should be the easiest of the dimensions.

In [28]:
# To HTML:
# !jupyter nbconvert --to html /content/CMSC320_Final.ipynb